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ABSTRACT 

Constructing effective representations is a critical but chal¬ 
lenging problem in multimedia understanding. The tradi¬ 
tional handcraft features often rely on domain knowledge, 
limiting the performances of exiting methods. This paper 
discusses a novel computational architecture for general im¬ 
age feature mining, which assembles the primitive filters (i.e. 
Gabor wavelets) into compositional features in a layer-wise 
manner. In each layer, we produce a number of base clas¬ 
sifiers (i.e. regression stumps) associated with the generated 
features, and discover informative compositions by using the 
boosting algorithm. The output compositional features of 
each layer are treated as the base components to build up 
the next layer. Our framework is able to generate expressive 
image representations while inducing very discriminate func¬ 
tions for image classification. The experiments are conducted 
on several public datasets, and we demonstrate superior per¬ 
formances over state-of-the-art approaches. 

Index Terms — Image Classification, Feature Mining, Hi¬ 
erarchical Composition, Deep Learning 

1. INTRODUCTION 

Feature engineering (i.e. constructing effective image repre¬ 
sentation) has been actively studied in machine learning and 
computer vision EGO ED . In literature, the terms feature se¬ 
lection or feature mining often refer to selecting a subset of 
relevant feature from a special feature space £DS1:2]|. One of 
the typical feature selection method is Adaboost algorithm, 
which merge the feature selection together with the learning 
procedure. According to previous work in 0 , Adaboost con¬ 
structs a pool of features (i.e. weak classifier) and selects the 
discriminative ones to form the final strong classifier. These 
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Fig. 1 : Illustration of layered feature mining for deep boost¬ 
ing. Each patch on the bottom denotes an absolute position in 
the image. Each layer of the deep boosting model except the 
last one comprises two stages: feature selection and compo¬ 
sition. In feature selection stage, the black circles indicate the 
visual primitive candidates in each layer, the selected features 
are marked as red. In composition stage, the compositional 
features are indicated by triangles which is the weighted lin¬ 
ear combination of two selected features in the lower layer. 
At the highest layer, we employ all the final composition fea¬ 
tures to train the strong classifier to predict the class label of 
the query image. 


boosting-based approaches provide an effective way for im¬ 
age classification task and achieve outstanding results in the 
past decade. 

Despite the admitted success, such boosting methods are 
suffered from two essential problems. First, the weak clas¬ 
sifier selected at each boosting step is limited by their own 
discriminative ability when faces with complex classification 
















problems. In order to decrease the training error, the final 
classifier is linearly combined by a large numbers of weak 
classifiers through boosting |6| . On the other hand, amounts 
of effective learning procedure always lead the training error 
approaching to zero. However, under the unknown decision 
boundary, how to decrease the test error when training error 
is approaching zero is still an open issue m 

In recent decades, the hierarchical models, also known 
as deep models El 0 have played an irreplaceable role in 
multimedia and computer vision literature. Generally, such 
hierarchical architecture represents different layer of vision 
primitives such as pixels, edges, object parts and so on. The 
basic principles of hierarchical models are concentrated on 
two folds: (1) layerwise learning philosophy, whose goal is 
to learn single layer of the model individually and stack them 
to form the final architecture; (2) feature combination rules, 
which aim at utilizing the combination of low layer detected 
features to construct the high layer impressive features by in¬ 
troducing the activation function. In this paper, the related ex¬ 
citing researches inspire us to employ such compositional rep¬ 
resentation to construct the impressive features with more dis¬ 
criminative power. Different from previous works mmm 
applying the hierarchical generative model, we address the 
problem on general image classification directly and design 
the final classifier leveraging the generalization and discrimi¬ 
nation abilities. 

This paper proposes a novel feature mining framework, 
namely deep boosting , which aims to construct the effective 
discriminative features for image classification task. Com¬ 
pared with the concept ’mining’ proposed in Q, whose goal 
is picking a subset of features as well as modeling the entire 
feature space, we utilize the word to describe the process¬ 
ing of feature selection and combination, which is more re¬ 
lated to 0 For each layer, following the famous boosting 
method (7), our deep model sequentially selects visual fea¬ 
tures to learn the classifier to reduce the training error. In or¬ 
der to construct high-level discriminative representations, we 
composite selected features in the same layer and feed into 
higher layer to build a multilayer architecture. Another key 
to our approach is introducing the spatial information when 
combining the individual features, that inspires upper layer 
representation more structured on the local scale. The exper¬ 
iment shows that our method achieves excellent performance 
on image classification task. 

2. RELATED WORK 

In the past few decades, many works focus on designing dif¬ 
ferent types of features to capture the characteristics of im¬ 
ages such as color, SIFT and HoG CD. Based on these fea¬ 
ture descriptors, Bag-of-Feature (BoF) model seems to be the 
most classical image representation method in computer vi¬ 
sion and related multimedia applications. Several promising 
studies 02 El El were published to improve this traditional 


approach in different aspects. Among these extension, a class 
of sparse coding based methods [ IT, fl), which employ spa¬ 
tial pyramid matching kernel (SPM) proposed by Lazebnik et 
al , has achieved great success in image classification problem. 
Despite we are developing more and more effective represen¬ 
tation methods, the lack of high-level image expression still 
plagues us to build up the ideal vision system. 

On the other hand, learning hierarchical models to simul¬ 
taneously construct multiple levels of visual representation 
has received much attention recently CG). Our deep boost¬ 
ing method is partially motivated by recent developed deep 
learning techniques EIE- E). Different from previous hand¬ 
craft feature design method, deep model learns the feature 
representation from raw data and validly generates the high- 
level semantic representation. However, as shown in recent 
study m, these network-based hierarchical models always 
contain thousands of nodes in a single layer, and is too com¬ 
plex to control in real multimedia application. In contrast, 
an obvious characteristic of our study is that we build up the 
deep architecture to generate expressive image representation 
simply and obtains the near optimal classification rate in each 
layer. 

3. DEEP BOOSTING FOR IMAGE RECOGNITION 

3.1. Background: Gentle Adaboost 

We start with a brief review of Gentle Adaboost algorithm (7). 
Without loss of generality, considering the two-class classifi¬ 
cation problem, let (xi , yi )... (x n , Vn ) be the training sam¬ 
ples, where Xi is a feature representation of the sample and 
Hi G { — 1,1}. Wi is the sample weight related to xi. Gentle 
Adaboost BE) provides a simple additive model with the 
form, 

M 

F(Xi) = Y, (1) 

m =1 

where f m is called weak classifier in the machine learn¬ 
ing literature. It often defines f m as the regression stump 
= ah(xf > 5) + b, where h(-) denotes the indica¬ 
tor function, xf is the d -th dimension of the feature vector x^ 
S is a threshold, a and b are two parameters contributing to the 
linear regression function. In iteration m, the algorithm learns 
the parameter (d, S, a, b ) of / m (-) by weighted least-squares 
of yi to Xi with weight Wi , 

N 

l<d<D 53 Wi A ^ ~ Vi II 2 ’ ( 2 ) 

— — i= 1 

where D is the dimension of the feature space. In order 
to give much attention to the cases that are misclassified in 
each round, Gentle Adaboost adjusts the sample weight in the 
next iteration as Wi <— ~ yi f^ x d) and updates F{xi) <— 
F(xi ) + fm(xi). At last, the algorithm outputs the result 


of strong classifier as the form of sign function sign[F(xi)\. 
Please refer to 0El for more academic details. 



Fig. 2: Illustration of Feature Combination. A cluster on the 
bottom denotes a set of different selected visual primitives 
(i.e. Gabor wavelet filters) at the same position in the image. 
A Gabor wavelet filter is denoted by an ellipse. At the second 
layer, a composite feature, which is combined by two Gabor 
wavelet filters, is fed into the third layer as an upper visual 
primitive. The intensity of every ellipse indicates the weight 
of Gabor wavelet filter. 


3.2. Preprocessing 

The basic units in the Gentle Adaboost algorithm are indi¬ 
vidual features, also known as weak classifiers. Unlike the 
rectangle feature in 0 for face detection, we employ Gabor 
wavelets response as the image feature representation. Let 7 
be an image defined on image lattice domain and G be the 
Gabor wavelet elements with parameters (w,h,a,s), where 
(w, h ) is the central position belonging to the lattice domain, 
a and s denote the orientation and scale parameters. Follow¬ 
ing fl8l , we utilize the normalized term to make the Gabor 
responses comparable between different training images: 

£ 2 (») = jpn J2 K 7 ’ G v>,h,a,e)\ 2 , O) 

' ol w,h 

where \P\ is the total number of pixels in image 7, and A 
is the number of orientations. (•) denotes the convolution 
process. For each image 7, we normalize the local energy 
as |(7, G w ^,a,s}\ 2 /£ 2 {s) and define positive square root of 
such normalized result as feature response. In practice, we 
resize image into 120 x 120 pixels and apply one scale and 
eight orientations in our implementation, so there are total 
120 x120 x1x8 filter responses for each grayscale image. 

3.3. Discriminative Feature Selection 

In this subsection, we set up the relationship between the 
weak classifier and Gabor wavelet representation. After the 


Gabor responses calculated, we learn the classification func¬ 
tion utilizing the given feature set and the training set includ¬ 
ing both positive and negative images. Suppose the size of 
the training set is N. In our deep boosting system, the weak 
learning method is to select the single feature ( i.e. weak clas¬ 
sifier ) which best divides the positive and negative samples. 
To fix the notation, let X{ G R D be the feature representation 
of image 7*, where D is the dimension of the feature space. 
It is obvious that D = 120 x 120 x 1x8 in the first layer, 
corresponding to Gabor wavelets in Sec.( |3.2| ). Specifically, 
each element of X{ is a special Gabor response of image f 
(in the first layer) or their composition (in other layers). Note 
that in the rest of the paper, we apply x f to denote the value 
of Xi in the d-th dimension. In each round of feature selection 
procedure, instead of using the indictor function in Eq.([2]), we 
introduce the sigmoid function defined by the formula: 

<Kx) = 1/(1+ e-*) (4) 

In this way, we consider a collection of regressive function 
I/ 1 , / 2 ,..., f D } where each f d is a candidate weak classifier 
whose definition is given in Definition. [I] 

Definition 1 (Discriminative Feature Selection) 

In each round, the algorithm retrieves all of the candidate 
regression functions, each of which is formulated as: 

f d (xi) = a<p(x d - (5) +b, (5) 

where <j>{-) is a sigmoid function defined in Eq. (JTJl. The candi- 
date function with current minimum training error is selected 
as the current weak classifier f, such that 

N 

mm5> || f d (xi) - yi || 2 , (6) 

where f d (xi ) is associate with the d-th element of Xi and the 
function parameter (S, a, b). 

According to the above discussion, we build the bridge 
between the weak classifier and the special Gabor wavelet (or 
their composition ), thus the weak classifiers learning can be 
viewed as the feature selection procedure in our deep boosting 
model. 

3.4. Composite Feature Construction 

Since the classification accuracy based on an individual fea¬ 
ture or single weak classifier is usually low and the strong 
classifier, which is the weighted linear combination of weak 
classifiers, is hardly to decease the test error when training 
error is approaching to zero. It is of our interest to improve 
the discriminative ability of features and learn high-level rep¬ 
resentations as well. 






































In order to achieve the goal above, we introduce the fea¬ 
ture combination strategy in Definition |2] All features se¬ 
lected in the feature selection stage are combined in a pair¬ 
wise manner with spatial constraints, and the output compo¬ 
sition features of each layer are treated as base components to 
construct the next layer. 


Definition 2 (Feature Combination Rule) 

For each image I, whose feature representation is denoted by 
x, we combine two selected features in local area as, 

[x 3 4 \i+i = Ps [x 3 ]i + fit [z‘]z 3 s, t e ft(j) (7) 


where [x s ]i and [x^i indicate the s-th and t-th feature re¬ 
sponse corresponding to the image I in the layer l. 


As illustrate in the Fig. 0, x s and x * l 2 are response values 
of selected features which are indicated by the red circles in 
each layer. /3 S and /3 t are the combination weights proportion 
to the training error rates of 8-th and t-th weak classifiers cal¬ 
culated over the training set. Q(j) is the local area determined 
by the projection coordinate of composition feature j on the 
normalized image ( i.e. the image with the size of 120 x 120 
pixels in practice ). In the higher layer, the feature selection 
process is the same as the lower layer, which can be formu¬ 
lated as Eq. Please refer to Fig. © for more details about 
feature combination. 

Integrating the two stages in Sec.(3.3) and Sec.(3.4), we 
build up the single layer of our model. Then we stack them 
to form the final deep boosting architecture which consist of 
many layers. The overall of our feature mining algorithm is 
summarized in Algorithm^. 


Algorithm 1 Deep Boosting for Feature Mining 

Input: 

Positive and negative training samples (x \, yi )... (xn , ), 
the number of selected features M[ in layer /, the total layer 
number L. 

Output: 

A pool of generated features and the final classifier 
F l ( x ) for a special category. 

Repeat for / = 1, 2,.... L: 

1. Start with score F l (x) = 0 for layer l and sample 
weights Wi = 1/N, i = 1, 2,..., N. 

2. Select features and learn the strong classifier for 
layer l as follows: 

Repeat for m = 1, 2,..., M l \ 

(a) Learn the current weak classifier f m by Eq.Q. 

(b) Update^ <— Wie~ yi ^ m ^ and renormalize. 

(c) Update F l (x) <— F l {x) + / m (x). 

3. Update 4/ by / m (x), m = 1,2,..., M l . 

4. Generate the composite features according to Eq.Q. 


3.5. Multi-class Decision 

We employ the naive one-against-all strategy to handle the 
multi-class classification task in this paper. Given the train- 
ing data {(x;, yi)}^ =1 ,Vi € {1,2, K}, we train K binary 
strong classifiers, each of which returns a classification score 
for a special test image. In the testing phrase, we predict the 
label of image referring to the classifier with the maximum 
score. 
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Fig. 3: Visualizations of Deep Boosting, (a) Original image; 
(b) Visualizations of the 1 st layer; (c) Visualizations of the 
2 nd layer; (d) Visualizations of the 3 rd layer. Elliptical bars 
in each figure denote Gabor wavelets, and the shade of color 
shows the corresponding weight. 


4. EXPERIMENT 

4.1. Dataset and Experiment Setting 

We apply the proposed method on general classification task, 
using Caltech 256 Dataset lfl9ll and the 15 Scenes Dataset fl2l 
for validation. For both datasets, we split the data into training 



























Table 1: Classification Rate(%) on Caltech256 Class Sets - EasylO. 



desk-globe 

mars 

sheet-music 

sunflower 

tower-pisa 

trilobite 

watch 

zebra 

car-side 

face-easy 

AVERAGE 

ScSPM da 

92.31 

88.10 

87.04 

96.10 

100.0 

81.25 

90.06 

93.90 

100.0 

100.0 

92.87 

LLC fiTI 

92.30 

88.09 

81.48 

100.0 

96.66 

93.75 

92.98 

90.90 

100.0 

100.0 

93.61 

HoG+SVM E3 

89.09 

81.45 

64.16 

85.50 

71.66 

80.29 

84.75 

77.20 

99.28 

98.24 

83.16 

Ours 

100.0 

93.75 

91.66 

100.0 

96.66 

97.05 

82.97 

88.90 

100.0 

98.93 

94.99 


Table 2: Classification Rate(%) on Caltech256 Class Sets - VarlO. 


bear 

billiards 

blimp 

hamburger 

hummingbird 

laptop 

minotaur 

roulette 

skyscraper 

yo-yo 

AVERAGE 

ScSPM da 

80.55 

78.22 

64.28 

76.78 

62.79 

63.26 

59.61 

62.26 

87.69 

65.71 

70.11 

LLC da 

79.16 

74.19 

69.64 

78.57 

67.44 

70.40 

73.07 

56.60 

86.15 

64.28 

71.95 

HoG+SVM dD 

88.80 

74.49 

74.23 

81.15 

84.46 

83.23 

79.09 

69.13 

79.99 

65.24 

77.98 

Ours 

88.09 

62.38 

92.30 

96.15 

83.92 

80.88 

81.81 

100.0 

91.42 

77.50 

85.45 


and test, utilize the training set to discover the discriminative 
features and learn the strong classifiers, and apply the test to 
evaluate classification performance. 

As mentioned in Sec.( |3.2| ). For both datasets, we resize 
each image as 120 x 120 pixels, and simply set the Ga¬ 
bor wavelets with one scale and eight orientations. In each 
layer, the strong classifier training is performed in a super¬ 
vised manner and the number of selected features are set as 
1000, 800, 500 respectively. We combine the selected fea¬ 
tures in the 3 x 3 block densely and capture 3000 ~ 8000 
composite features every layer. According to the experiment, 
the number of composite features in each layer relies on the 
complexity of image content seriously. The visualization of 
feature map in each layer is shown in Fig. 0- 

We carry out the experiments on a PC with Core i7-3960X 
3.30 GHZ CPU and 24GB memory. On average, it takes 
5^9 hours for training a special category model, depend¬ 
ing on the numbers of training examples and the complexity 
of image content. The time cost for recognizing a image is 
around 25 ^ 40 seconds. 

4.2. Experiment I: Caltech 256 Dataset 

We evaluate the performance of our deep boosting algorithm 
on the Caltech 256 Dataset which is widely used as the 
benchmark for testing the general image classification task 
m H. The Caltech 256 Dataset contains 30607 images in 
256 categories. We consider the image classification problem 
on EasylO and VarlO image sets according to l20l . We evalu¬ 
ate classification results from 10 random splits of the training 
and testing data ( i.e. 60 training images and the rest as testing 
images ) and report the performance using the mean of each 
class classification rate. Besides our own implementations, 
we refer some released Matlab code from previous published 
literature iff?. \4\ in our experiments as well. As Tab. 0 and 
Tab .^ report, our method reaches the classification rate of 


94.9% and 85.4% on EasylO and VarlO datasets, outperform¬ 
ing other approaches Ell 03, 13.1- 

4.3. Experiment II: 15 Scenes Dataset 

We also test our method on the 15 Scenes Dataset Ifl2) . This 
dataset totally includes 4485 images collected from 15 repre¬ 
sentative scene categories. Each category contains at least 200 
images. The categories vary from mountain and forest to of¬ 
fice and living room. As the standard benchmark procedure in 
EMU, we select 100 images per class for training and others 
for testing. The performance is evaluated by randomly taking 
the training and testing images 10 times. The mean and stan¬ 
dard deviation of the recognition rates are shown in TableQ. 
In this experiment, our deep boosting method achieves bet¬ 
ter performance than previous works [ 21, [13) as well. Note 
that, instead of HoG+SVM, we compare our approach with 
GIST+SVM method in this experiment, due to the effective- 
ness of GIST ED in the scene classification task. Considering 
the subtle engineering details, we can hardly achieve desired 
results applying fl4) and [13 ] methods in our own implemen¬ 
tations. So we quote the reported result directly from CCD and 
abandon m as a way of comparison. We also compare the 
recognition rate utilizing different layer’s strong classifier, the 
results of top five outstanding categories on 15 Sences Dataset 
are reported in Fig. 0. It is obvious that our proposed feature 
combination strategy improve the performance effectively. 

Table 3: Classification Rate(%) on 15 Scenes Dataset. 


Algorithm 

mean Average Precision 

ScSPM m 

80.28 =b 0.93 

GIST+SVM ITT! 

75.12 ± 1.27 

Ours 

81.76 =b 0.97 












Fig. 4: Classification accuracy of our proposed deep boost¬ 
ing method applying each layer’s strong classifier. We select 
results from top five categories in 15 Scenes Dataset to report. 

5. CONCLUSION 

This paper studies a novel layered feature mining framework 
named deep boosting. According to the famous boosting al¬ 
gorithm, this model sequentially selects the visual feature in 
each layer and composites selected features in the same layer 
as the input of upper layer to construct the hierarchical ar¬ 
chitecture. Our approach achieves the excellent success on 
several image classification tasks. Moreover, the philosophy 
of such deep model is very general and can be applied to other 
multimedia applications. 
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